Bellman Equation

We now introduce the Bellman equation, a mathematical tool for analyzing state values.

In a nutshell 简而言之, the Bellman equation is a set of linear equations that describe the relationships between the values of all the states. State values

We next derive the Bellman equation. First, note that $G_{t}$ can be rewritten as

\begin{aligned} G_{t} & = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots \\ = R_{t + 1} + γ (R_{t + 2} + γ R_{t + 3} + \dots) \\ = R_{t + 1} + γ G_{t + 1}, \end{aligned}

where $G_{t + 1} = R_{t + 2} + γ R_{t + 3} + \dots$ . This equation establishes the relationship between $G_{t}$ and $G_{t + 1}$ . Then, the state value can be written as

\begin{aligned} v_{π} (s) & = E [G_{t} | S_{t} = s] \\ = E [R_{t + 1} + γ G_{t + 1} | S_{t} = s] \\ = E [R_{t + 1} | S_{t} = s] + γ E [G_{t + 1} | S_{t} = s] . \end{aligned} (2.4)

The two terms in (2.4) are analyzed below.
可以分为两项，第一项是现在，第二项是加上折扣因子的未来

The first term, $E [R_{t + 1} | S_{t} = s]$ , is the expectation of the immediate rewards. By using the law of total expectation (Appendix A), it can be calculated as
策略 $π$ 下，采取动作 $a$ 的概率

\begin{aligned} E [R_{t + 1} | S_{t} = s] & = \sum_{a \in A} π (a | s) E [R_{t + 1} | S_{t} = s, A_{t} = a] \\ = \sum_{a \in A} π (a | s) \sum_{r \in R} p (r | s, a) r . \end{aligned} (2.6)

Here, $A$ and $R$ are the sets of possible actions and rewards, respectively. It should be noted that $A$ may be different for different states. In this case, $A$ should be written as $A (s)$ . Similarly, $R$ may also depend on $(s, a)$ . We drop the dependence on $s$ or $(s, a)$ for the sake of simplicity in this book. Nevertheless, the conclusions are still valid in the presence of dependence.

The second term, $E [G_{t + 1} | S_{t} = s]$ , is the expectation of the future rewards. It can be calculated as

状态 $s$ 下，下一步为 $s^{'}$ 的概率乘以下一步为 $s^{'}$ 的discounted return的期望

$s^{'}$ 的discounted return的期望是 state value
下一步为 $s^{'}$ 的概率, 可以由策略 $π$ 下采取动作 $a$ 到达状态 $s^{'}$ 的概率

\begin{aligned} E [G_{t + 1} | S_{t} = s] & = \sum_{s^{'} \in S} E [G_{t + 1} | S_{t} = s, S_{t + 1} = s^{'}] p (s^{'} | s) \\ = \sum_{s^{'} \in S} E [G_{t + 1} | S_{t + 1} = s^{'}] p (s^{'} | s) (due to the Markov property) \\ = \sum_{s^{'} \in S} v_{π} (s^{'}) p (s^{'} | s) \\ = \sum_{s^{'} \in S} v_{π} (s^{'}) \sum_{a \in A} p (s^{'} | s, a) π (a | s) . \end{aligned} (2.6)

The above derivation uses the fact that $E [G_{t + 1} | S_{t} = s, S_{t + 1} = s^{'}] = E [G_{t + 1} | S_{t + 1} = s^{'}],$ which is due to the Markov property that the future rewards depend merely on the present state rather than the previous ones.

Substituting (2.5)-(2.6) into (2.4) yields

\begin{aligned} v_{π} (s) & = E [R_{t + 1} | S_{t} = s] + γ E [G_{t + 1} | S_{t} = s], \\ = \underset{mean of immediate rewards}{\underset{⏟}{\sum_{a \in A} π (a | s) \sum_{r \in R} p (r | s, a) r}} + \underset{mean of future rewards}{\underset{⏟}{γ \sum_{a \in A} π (a | s) \sum_{s^{'} \in S} p (s^{'} | s, a) v_{π} (s^{'})}} \\ = \sum_{a \in A} π (a | s) [\sum_{r \in R} p (r | s, a) r + γ \sum_{s^{'} \in S} p (s^{'} | s, a) v_{π} (s^{'})], for all s \in S . \end{aligned} (2.7)

This equation is the Bellman equation, which characterizes the relationships of state values. It is a fundamental tool for designing and analyzing reinforcement learning algorithms.

The Bellman equation seems complex at first glance. In fact, it has a clear structure.
Some remarks are given below.

$v_{π} (s)$ and $v_{π} (s^{'})$ are unknown state values to be calculated. It may be confusing to beginners how to calculate the unknown $v_{π} (s)$ given that it relies on another unknown $v_{π} (s^{'})$ . It must be noted that the Bellman equation refers to a set of linear equations for all states rather than a single equation. If we put these equations together, it becomes clear how to calculate all the state values. Details will be given in Section 2.7.
$π (a | s)$ is a given policy. Since state values can be used to evaluate a policy, solving the state values from the Bellman equation is a policy evaluation process, which is an important process in many reinforcement learning algorithms, as we will see later in the book.
$p (r | s, a)$ and $p (s^{'} | s, a)$ represent the system model. We will first show how to calculate the state values with this model in Section 2.7, and then show how to do that without the model by using model-free algorithms later in this book.

In addition to the expression in (2.7), readers may also encounter other expressions of the Bellman equation in the literature. We next introduce two equivalent expressions.
First, it follows from the law of total probability that

\begin{aligned} p (s^{'} | s, a) & = \sum_{r \in R} p (s^{'}, r | s, a), \\ p (r | s, a) & = \sum_{s^{'} \in S} p (s^{'}, r | s, a) . \end{aligned}

Then, equation (2.7) can be rewritten as

v_{π} (s) = \sum_{a \in A} π (a | s) \sum_{s^{'} \in S} \sum_{r \in R} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})] .

Second, the reward $r$ may depend solely on the next state $s^{'}$ in some problems. As a result, we can write the reward as $r (s^{'})$ and hence $p (r (s^{'})) | s, a) = p (s^{'} | s, a)$ , substituting which into (2.7) gives

v_{π} (s) = \sum_{a \in A} π (a | s) \sum_{s^{'} \in S} p (s^{'} | s, a) [r (s^{'}) + γ v_{π} (s^{'})] .

2.5 Examples for illustrating the Bellman equation

We next use two examples to demonstrate how to write out the Bellman equation and calculate the state values step by step. Readers are advised to carefully go through the examples to gain a better understanding of the Bellman equation.

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deterministic.

Consider the first example shown in Figure 2.4, where the policy is deterministic. We next write out the Bellman equation and then solve the state values from it.

First, consider state $s_{1}$ . Under the policy, the probabilities of taking the actions are $π (a = a_{3} | s_{1}) = 1$ and $π (a \neq a_{3} | s_{1}) = 0$ . The state transition probabilities are $p (s^{'} = s_{3} | s_{1}, a_{3}) = 1$ and $p (s^{'} \neq s_{3} | s_{1}, a_{3}) = 0$ . The reward probabilities are $p (r = 0 | s_{1}, a_{3}) = 1$ and $p (r \neq 0 | s_{1}, a_{3}) = 0$ . Substituting these values into (2.7) gives

v_{π} (s_{1}) = 0 + γ v_{π} (s_{3}) .

Interestingly, although the expression of the Bellman equation in (2.7) seems complex, the expression for this specific state is very simple.

Similarly, it can be obtained that

\begin{aligned} v_{π} (s_{2}) & = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{3}) & = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{4}) & = 1 + γ v_{π} (s_{4}) . \end{aligned}

We can solve the state values from these equations. Since the equations are simple, we can manually solve them. More complicated equations can be solved by the algorithms presented in Section 2.7. Here, the state values can be solved as

v_{π} (s_{4}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{3}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{2}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{1}) = \frac{γ}{1 - γ} .

Furthermore, if we set $γ = 0.9$ , then

v_{π} (s_{4}) = \frac{1}{1 - 0.9} = 10, $ $ $ $ v_{π} (s_{3}) = \frac{1}{1 - 0.9} = 10, $ $ $ $ v_{π} (s_{2}) = \frac{1}{1 - 0.9} = 10, $ $ $ $ v_{π} (s_{1}) = \frac{0.9}{1 - 0.9} = 9.

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

Consider the second example shown in Figure 2.5, where the policy is stochastic. We next write out the Bellman equation and then solve the state values from it.

In state $s_{1}$ , the probabilities of going right and down equal 0.5. Mathematically, we have $π (a = a_{2} | s_{1}) = 0.5$ and $π (a = a_{3} | s_{1}) = 0.5$ . The state transition probability is deterministic since $p (s^{'} = s_{3} | s_{1}, a_{3}) = 1$ and $p (s^{'} = s_{2} | s_{1}, a_{2}) = 1$ . The reward probability is also deterministic since $p (r = 0 | s_{1}, a_{3}) = 1$ and $p (r = - 1 | s_{1}, a_{2}) = 1$ . Substituting these values into (2.7) gives

v_{π} (s_{1}) = 0.5 [0 + γ v_{π} (s_{3})] + 0.5 [- 1 + γ v_{π} (s_{2})] .

Similarly, it can be obtained that

v_{π} (s_{2}) = 1 + γ v_{π} (s_{4}), $ $ $ $ v_{π} (s_{3}) = 1 + γ v_{π} (s_{4}), $ $ $ $ v_{π} (s_{4}) = 1 + γ v_{π} (s_{4}) .

The state values can be solved from the above equations. Since the equations are

simple, we can solve the state values manually and obtain

v_{π} (s_{4}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{3}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{2}) = \frac{1}{1 - γ}, $ $ $ $ v_{π} (s_{1}) = 0.5 [0 + γ v_{π} (s_{3})] + 0.5 [- 1 + γ v_{π} (s_{2})], $ $ $ $ = - 0.5 + \frac{γ}{1 - γ} .

Furthermore, if we set $γ = 0.9$ , then

v_{π} (s_{4}) = 10, $ $ $ $ v_{π} (s_{3}) = 10, $ $ $ $ v_{π} (s_{2}) = 10, $ $ $ $ v_{π} (s_{1}) = - 0.5 + 9 = 8.5 .

If we compare the state values of the two policies in the above examples, it can be seen that

v_{π_{1}} (s_{i}) \geq v_{π_{2}} (s_{i}), i = 1, 2, 3, 4,

which indicates that the policy in Figure 2.4 is better because it has greater state values. This mathematical conclusion is consistent with the intuition that the first policy is better because it can avoid entering the forbidden area when the agent starts from $s_{1}$ . As a result, the above two examples demonstrate that state values can be used to evaluate policies.